The below Rmarkdown will guide you through the steps that generated the graphs used in the publications figures.
NOTE: all the scripts that are being sourced below, are present inside the figures/R folder in the publication’s GitHub repo.
source(file = "./R/load_packages.R")
source(file = "./R/load_data.R")
source(file = "./R/preprocess_data.R")
Here we prepare the data to calculate SA. Note that the below code will produce the selected graphs. If you want to have a look at the exploration before the selection then refer to the figure_SA.R script inside the figures/R folder in the publication’s GitHub repo.
NOTE: during the normalisation using Seurat, we selected the 2000 most variable genes (as per Seurat’s default option).
### Get spot names ----
nb_names <- polygons$Barcode
### Get counts for the 2000 most variable genes ----
### The counts are normalised
### It is important at the last step to order the rows as the vector of names above
SA_counts_Nor <- obs_Nor %>%
filter(rownames(.) %in% var_genes) %>%
t() %>%
as.data.frame() %>%
.[nb_names,]
### Find neighbours ----
neighbours_knn <- knn2nb(knearneigh(polygons$geom_cntd, k = 6),
row.names = nb_names)
### Calculate neighbour weights ----
neighbours_w_exp <- nb2listwdist(neighbours_knn, polygons$geom_cntd,
type = "idw", style = "raw", alpha = 2,
zero.policy = TRUE)
### Set plot output prefix ----
prfx <- "countNorm"
### Set active dataset ----
data <- SA_counts_Nor
After exploring the genes for their global Moran’s I value we identified the below two genes:
ENSG00000197971: as the gene with the highest positive Moran’s I (meaning high positive SA)
## For the selected gene
SA_genes <- "ENSG00000197971"
### Prepare the data ----
gCounts <- data[,SA_genes]
### Calculate Moran's I ----
names(gCounts) <- rownames(data)
moran.multi.local <- localmoran(gCounts,
listw = neighbours_w_exp,
spChk = TRUE)
rm(gCounts)
Moran-plot
Moran Stats table
## a flextable object.
## col_keys: `Test type`, `Moran's I`, `P-value`, `Expected`, `Variance`
## header has 1 row(s)
## body has 2 row(s)
## original dataset sample:
## Test type Moran's I P-value Expected Variance
## 1 Z-score 0.8006 <0.00001 -0.000277 9.3e-05
## 2 MC perm 0.8006 0.001 N/A N/A
Normalised gene expression maps
NOTE: the counts are Seurat-normalised and represented as log2-transformed on the map. The transformation took place as log2(x + 1).
## Joining with `by = join_by(Barcode)`
The local Moran plots are generated only for the ENSG00000197971 gene with positive SA.
Prepare the data
### Prepare the data ----
quadr <- attr(moran.multi.local, "quadr")
moran.local.map <- moran.multi.local %>%
as.data.frame() %>%
rownames_to_column(var = "Barcode") %>%
dplyr::rename("p.value" = `Pr(z != E(Ii))`) %>%
left_join(polygons[,"Barcode"]) %>%
mutate(Quadrant = quadr$pysal,
colours = Quadrant,
colours = as.factor(case_when(p.value > 0.05 ~ "#ffffff",
Quadrant == "Low-Low" ~ "#0066ff",
Quadrant == "High-Low" ~ "#ffb3b3",
Quadrant == "Low-High" ~ "#99c2ff",
Quadrant == "High-High" ~ "#ff0000")),
label = as.factor(case_when(p.value > 0.05 ~ "Not signif.",
Quadrant == "Low-Low" ~ "Low-Low",
Quadrant == "High-Low" ~ "High-Low",
Quadrant == "Low-High" ~ "Low-High",
Quadrant == "High-High" ~ "High-High")),
signif = as.factor(case_when(p.value < 0.05 ~ "Signif.",
p.value >= 0.05 ~ "Not Signif.")),
b_geom = if_else(signif == "Signif.",
geom_pol,
NA))
Local Moran Map
Local Moran clusters
Here we randomised the gene expression of gene ENSG00000197971 over space to generate a negative control and showcase the importance of space in spatial data.
Prepare the data
g = "ENSG00000197971"
order <- sample(1:nrow(data), size = nrow(data), replace = FALSE)
N_control <- data %>%
dplyr::select(all_of(g)) %>%
rownames_to_column(var = "Barcode") %>%
mutate(Barcode = Barcode[order]) %>%
left_join(polygons[, c("Barcode", "geom_cntd")])
rownames(N_control) <- N_control$Barcode
neighbours_knn_N_control <- knn2nb(knearneigh(N_control$geom_cntd, k = 6),
row.names = rownames(N_control))
neighbours_N_Control <- nb2listwdist(neighbours_knn_N_control, N_control$geom_cntd,
type = "idw", style = "raw", alpha = 2,
zero.policy = TRUE)
Normalised gene expression map
NOTE: the counts are Seurat-normalised and represented as log2-transformed on the map. The transformation took place as log2(x + 1).
Moran Stats table
## a flextable object.
## col_keys: `Test type`, `Moran's I`, `P-value`, `Expected`, `Variance`
## header has 1 row(s)
## body has 2 row(s)
## original dataset sample:
## Test type Moran's I P-value Expected Variance
## 1 Z-score 0.002 0.405998 -0.000277 9.3e-05
## 2 MC perm 0.002 0.411 N/A N/A
Local Moran
## For the selected gene
SA_genes <- "ENSG00000197971"
### Prepare the data ----
gCounts <- N_control[,SA_genes]
### Calculate Moran's I ----
names(gCounts) <- N_control$Barcode
moran.multi.local <- localmoran(gCounts,
listw = neighbours_N_Control,
spChk = TRUE)
rm(gCounts)
Here we prepare the data to calculate MAUP. Note that the below code will produce the selected graphs. If you want to have a look at the exploration before the selection then refer to the figure_MAUP.R script inside the figures/R folder in the publication’s GitHub repo.
NOTE 1: here we selected to work with
Seurat-Normalised counts.
NOTE 2: we filtered out genes that were expressed in
less than 1000 spots.
### Get counts to be used ----
select <- SA_counts_Nor %>%
t() %>%
as.data.frame() %>%
filter(rowSums(. != 0) > 1000) %>%
rownames(.)
cor.input <- SA_counts_Nor %>%
dplyr::select(all_of(select))
### Set plot output prefix ----
prfx <- "countNorm"
### Get a set of colours ----
## Colourblind-friendly colours using cols4all
## c4a palette: c4a.bu_br_bivs
cols_grid <- c("#4992C8", "#AECBEA")
## c4a palette: misc.okabe
cols_lay <- c("#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00",
"#CC79A7")
To showcase the change in the correlation statistic as aggregation changes we selected genes that had a low Spearman R. More specifically the selection pool consisted of genes that had 0.3 < |R| < 0.5.
A set of 10 genes pair were selected in random using the sample function from R. From these 10 pairs the below was chosen again in random:
| gene_X | gene_Y | R |
|---|---|---|
| ENSG00000104888 | ENSG00000132639 | 0.3635024 |
## Spearman correlations between all genes
cor.multi <- cor(cor.input, cor.input, method = "spearman")
diag(cor.multi) <- 0 #set diagonal to zero
g1 <- seq_len(nrow(cor.multi))
g2 <- max.col(abs(cor.multi))
cor.multi.Rs <- cor.multi %>%
as.data.frame() %>%
mutate(gene_X = rownames(.),
gene_Y = colnames(.)[g2],
R = .[cbind(seq_along(g1), g2)]) %>%
arrange(desc(abs(R))) %>%
select(gene_X, gene_Y, R) %>%
filter(R < 0.5 & R > 0.3)
genePairIdx <- 45
rm(g1, g2)
We showcase the MAUP in two different zone types.
1. A grid.
2. The tissue regions.
For the grid we need a 361-groups grid. This is so that we reduce the number of observations by one order of magnitude from the 3610 individual spots we have initially in the data set. To find how many breaks we should have on each side of the tissue we calculate: sqrt(361) = 19 so we need roughly 19 groups on the X and 19 on the Y axis. The X has a length of 115 and the Y of 76. Therefore, 115/19 = 6.05 (so we need to break every 6 spots) and 76/19 = 4 (so we need to break every 4 spots).
NOTE: because the tissue is not a perfect square the groups are eventually going to be less than 361.
### Prepare the data ----
## Get the range of the locations
rangeX <- range(meta_D[meta_D[, "Section"] == 1, "Spot_X"])
rangeY <- range(meta_D[meta_D[, "Section"] == 1, "Spot_Y"])
## Store the information in columns denoting groups.
dt_MAUP <- polygons %>%
left_join(meta_D[,c("Barcode", "Spot_X", "Spot_Y")], by = "Barcode") %>%
mutate(group_X = as.numeric(cut(.data[["Spot_X"]],
breaks = seq(rangeX[1], rangeX[2], by = 6),
include.lowest = TRUE)),
group_Y = as.numeric(cut(.data[["Spot_Y"]],
breaks = seq(rangeY[1], rangeY[2], by = 4),
include.lowest = TRUE)),
group_361 = factor(paste(group_X, group_Y, sep = "_")))
## The grid-like grouping will have a two-colour scheme
cols <- rep(cols_grid, 181)
colours <- expand.grid(group_X = seq(1:19),
group_Y = seq(1:19)) %>%
mutate(group_361 = factor(paste(group_X, group_Y, sep = "_"))) %>%
mutate(cols = cols[1:nrow(.)]) %>%
dplyr::select(cols, group_361)
dt_MAUP <- dt_MAUP %>%
left_join(colours)
As a summary statistic we selected the mean expression because the mean is not affected by the size of each area. On the other hand, the sum is affected by how big is the area (If a gene is almost uniformly expressed throughout, then: number of spots in area goes UP –> gene expression goes UP –> correlation goes UP).
### Prepare the data ----
## Add zones to the correlation input df
cor.input <- cor.input %>%
rownames_to_column(var = "Barcode") %>%
left_join(dt_MAUP[,c("Barcode", "group_361", "layer")]) %>%
st_drop_geometry() %>%
dplyr::select(matches("[^geom_pol]")) %>%
column_to_rownames(var = "Barcode")
## Generate the grouped correlation inputs
groups <- c("group_361", "layer")
for (g in groups) {
cIn <- cor.input %>% # get the correlation input
dplyr::select(matches(paste0("ENSG|", g))) %>% # select one group type
group_by_at(vars(g)) %>% # group by group type and gene
summarise(across(matches("ENSG"), tibble::lst(mean))) # summarise per group per column
out <- cIn %>%
dplyr::select(ends_with("_mean")) %>%
ungroup() %>%
select_all(~gsub("_mean", "", .))
assign(paste0("cor.input_", g, "_mean"), out)
rm(g, cIn, out)
}
## Working on: cor.input_group_361_mean
## Working on: cor.input_layer_mean
This map will be used for the correlations.
Here we prepare the data to calculate SH Note that the below code will produce the selected graphs. If you want to have a look at the exploration before the selection then refer to the figure_SH.R script inside the figures/R folder in the publication’s GitHub repo.
NOTE 1: The list of 2000 genes from Seurat comes after variance stabilising and selecting the most variable genes. We want to narrow this further down and get a pool of genes that are variable and expressed in more than 1000 locations on the tissue.
NOTE 2: After filtering the pool of genes was reduced to 435 genes.
### Get Seurat-normalised data ----
data <- SA_counts_Nor
### Select a set of genes to test ----
## The list of 2000 genes from Seurat comes after variance stabilising and
## selecting the most variable genes. We want to narrow this further down and
## get a pool of genes that are variable and expressed in more than 2/3 of the
## tissue area.
#### Filter for number of locations expressed ----
SH_genes <- data %>%
t() %>%
as.data.frame() %>%
filter(rowSums(. != 0) > 1000) %>%
rownames(.)
### Add geometries and convert to sp format ----
dt_SH <- data %>%
dplyr::select(all_of(SH_genes)) %>%
rownames_to_column(var = "Barcode") %>% # barcodes from row names to column
left_join(polygons[,c("geom_pol", "Barcode")]) %>% # merge with geometries
column_to_rownames(var = "Barcode") %>% # barcodes back to row names
st_as_sf(., sf_column_name = "geom_pol") %>% # data frame to sf
as(., "Spatial") # sf to sp - GWmodel still works with sp objects
## Joining with `by = join_by(Barcode)`
### Set plot output prefix ----
prfx <- "countNorm"
## Housekeeping
rm(data)
Here we run a linear regression model for all possible combinations for the 435 selected genes, excluding self-pairs.
After running the regressions, we sorted the pairs based on their R2 and selected 26 pairs with R2 > 0.4.
These 26 pairs were fed to the GWR model.
From the initial 26 pairs, the one below was selected to showcase SH.
### Determine the kernel bandwidth ----
bw <- bw.gwr("ENSG00000091513~ENSG00000197971",
approach = "AIC",
adaptive = T,
data = dt_SH)
## Take a cup of tea and have a break, it will take a few minutes.
## -----A kind suggestion from GWmodel development group
## Adaptive bandwidth (number of nearest neighbours): 2238 AICc value: 8302.722
## Adaptive bandwidth (number of nearest neighbours): 1391 AICc value: 8202.582
## Adaptive bandwidth (number of nearest neighbours): 866 AICc value: 8139.5
## Adaptive bandwidth (number of nearest neighbours): 543 AICc value: 8073.116
## Adaptive bandwidth (number of nearest neighbours): 342 AICc value: 8026.095
## Adaptive bandwidth (number of nearest neighbours): 219 AICc value: 8005.941
## Adaptive bandwidth (number of nearest neighbours): 141 AICc value: 8010.028
## Adaptive bandwidth (number of nearest neighbours): 265 AICc value: 8011.159
## Adaptive bandwidth (number of nearest neighbours): 188 AICc value: 8003.862
## Adaptive bandwidth (number of nearest neighbours): 171 AICc value: 8005.641
## Adaptive bandwidth (number of nearest neighbours): 200 AICc value: 8004.689
## Adaptive bandwidth (number of nearest neighbours): 181 AICc value: 8003.765
## Adaptive bandwidth (number of nearest neighbours): 176 AICc value: 8004.691
## Adaptive bandwidth (number of nearest neighbours): 183 AICc value: 8003.338
## Adaptive bandwidth (number of nearest neighbours): 185 AICc value: 8003.593
## Adaptive bandwidth (number of nearest neighbours): 182 AICc value: 8003.637
## Adaptive bandwidth (number of nearest neighbours): 184 AICc value: 8003.728
## Adaptive bandwidth (number of nearest neighbours): 182 AICc value: 8003.637
## Adaptive bandwidth (number of nearest neighbours): 183 AICc value: 8003.338
## The bandwidth is an adaptive bandwidth (kernel size) indicating that its size
## varies but the nearest 183 observations will be used to compute each local
## weighted regression.
## Here it, has a value of 183 indicating that 5% of the tissue data will be
## used in each local calculation (there are 3610 records in the data):
message("Bandwidth (bw): ", bw)
## Bandwidth (bw): 183
gwr <- gwr.basic("ENSG00000091513~ENSG00000197971",
adaptive = T,
data = dt_SH,
bw = bw)
gwr
## ***********************************************************************
## * Package GWmodel *
## ***********************************************************************
## Program starts at: 2023-03-15 13:41:04
## Call:
## gwr.basic(formula = "ENSG00000091513~ENSG00000197971", data = dt_SH,
## bw = bw, adaptive = T)
##
## Dependent (y) variable: NA
## Independent variables:
## Number of data points: 3610
## ***********************************************************************
## * Results of Global Regression *
## ***********************************************************************
##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8116 -0.6743 0.0350 0.6262 4.5189
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.01147 0.03884 -26.04 <2e-16 ***
## ENSG00000197971 0.65023 0.01161 56.00 <2e-16 ***
##
## ---Significance stars
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.8197 on 3608 degrees of freedom
## Multiple R-squared: 0.465
## Adjusted R-squared: 0.4648
## F-statistic: 3136 on 1 and 3608 DF, p-value: < 2.2e-16
## ***Extra Diagnostic information
## Residual sum of squares: 2424.145
## Sigma(hat): 0.8196832
## AIC: 8813.13
## AICc: 8813.137
## BIC: 5246.279
## ***********************************************************************
## * Results of Geographically Weighted Regression *
## ***********************************************************************
##
## *********************Model calibration information*********************
## Kernel function: bisquare
## Adaptive bandwidth: 183 (number of nearest neighbours)
## Regression points: the same locations as observations are used.
## Distance metric: Euclidean distance metric is used.
##
## ****************Summary of GWR coefficient estimates:******************
## Min. 1st Qu. Median 3rd Qu. Max.
## Intercept -2.450513 0.020129 0.256662 0.617590 7.8805
## ENSG00000197971 -0.947430 0.013307 0.106291 0.260641 0.9704
## ************************Diagnostic information*************************
## Number of data points: 3610
## Effective number of parameters (2trace(S) - trace(S'S)): 144.4093
## Effective degrees of freedom (n-2trace(S) + trace(S'S)): 3465.591
## AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 8003.338
## AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 7890.709
## BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 5030.39
## Residual sum of squares: 1827.128
## R-square value: 0.5967473
## Adjusted R-square value: 0.5799391
##
## ***********************************************************************
## Program stops at: 2023-03-15 13:41:05
gwr.tab <- apply(gwr$SDF@data[, 1:7], 2, summary)
gwr.tab <- round(gwr.tab, 1)
t(gwr.tab[,1:2])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Intercept -2.5 0 0.3 0.5 0.6 7.9
## ENSG00000197971 -0.9 0 0.1 0.2 0.3 1.0
## Housekeeping
rm(gwr.tab)